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Abstract 

This paper presents the design and analysis of paraUel approximation algorithms for facility- 
location problems, including NC and RNC algorithms for (metric) facility location, fc-center, fc-mcdian, 
and fc-means. These problems have received considerable attention during the past decades from 
the approximation algorithms community, concentrating primarily on improving the approximation 
guarantees. In this paper, we ask, is it possible to parallelize some of the beautiful results from the 
sequential setting? 

Our starting point is a small, but diverse, subset of results in approximation algorithms for 
facility-location problems, with a primary goal of developing techniques for devising their efficient 
parallel counterparts. We focus on giving algorithms with low depth, near work efficiency (compared 
to the sequential versions), and low cache complexity. Common in algorithms we present is the 
idea that instead of picking only the most cost-effective element, we make room for parallelism by 
allowing a small slack (e.g., a (1 + e) factor) in what can be selected — then, we use a clean-up step 
to ensure that the behavior does not deviate too much from the sequential steps. In this paper, 
we first present a parallel RNC algorithm mimicking the greedy algorithm of Jain et al. (J. ACM, 
50(6):795-824, 2003.). This is the most challenging algorithm to parallelize because the greedy 
algorithm is inherently sequential. We show the algorithm gives a (3.722 -I- £)-approximation and 
does 0{m log^_|_g m) work, which is within a logarithmic factor of the serial algorithm. Then, we 
present a simple RNC algorithm using the primal-dual approach of Jain and Vazirani (J. ACM, 
48(2). -274-296, 2001.), which leads to a (3 -I- e)-approximation, and for input of size m runs in 
0{m\ogij^^m) work, which is the same as the sequential work. The sequential algorithm is a 3- 
approximation. Following that, we present a local-search algorithm for fc-median and A:-means, with 
approximation factors of 5 -|- e and 81 -I- e, matching the guarantees of the sequential algorithms. For 
constant k, the algorithm does 0(n^ log n) work, which is the same as the sequential counterpart. 
Furthermore, we present a 2-approximation algorithm for fc-center with 0((n log n)^) work, based 
on the algorithm of Hochbaum and Shmoys {Math. OR, 10(2):180~184, 1985.). Finally, we show a 
0(m log2+£(m))-work randomized rounding algorithm, which yields a (4-t-£)-approximation, given an 
optimal linear-program solution as input. The last two algorithms run in work within a logarithmic 
factor of the serial algorithm counterparts. All these algorithms arc "cache efficient" in that the 
cache complexity is bounded by 0{w/B), where w is the work in the EREW model and B is the block 
size. 
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1 Introduction 



Facility location is an important and well-studied class of problems in approximation algorithms, with 
far-reaching implications in areas as diverse as machine learning, operations research, and networking: 
the popular fc-means clustering and many network-design problems are all examples of problems in 
this class. Not only are these problems important because of their practical value, but they appeal 
to study because of their special stature as "testbeds" for techniques in approximation algorithms. 
Recent research has focused primarily on improving the approximation guarantee, producing a series 
of beautiful results, some of which are highly efficient — often, with the sequential running time within 
constant or poly logarithmic factors of the input size. 

Despite significant progress on these fronts, work on developing parallel approximation algorithms for 
these problems remains virtually non-existent. Although variants of these problems have been considered 
in the the distributed computing setting ||MW05| , GLS06 , PP09], to the best our of knowledge, almost 



no prior work has looked directly in the parallel setting where the total work and parallel time (depth) 
are the parameters of concern. The only prior work on these problems is due to Wang and Cheng, who 
gave a 2- approximation algorithm for fc-center that runs in O(nlog^n) depth and 0{n^) work |WC9C], 
a result which we improve in this paper. 

Deriving parallel algorithms for facility location problems is a non-trivial task and will be a valuable step 
in understanding how common techniques in approximation algorithms can be parallelized efficiently. 
Previous work on facility location commonly relies on techniques such as linear-program (LP) rounding, 
local search, primal dual, and greedy. Unfortunately, LP rounding relies on solving a class of linear 
programs not known to be solvable efficiently in polylogarithmic time. Neither do known techniques 
allow for parallelizing local-search algorithms. Despite some success in parallelizing primal-dual and 
greedy algorithms for set-covering, vertex-covering, and related problems, these algorithm are obtained 
using problem-specific techniques, which are not readily applicable to other problems. 

1.1 Summary of Results. 

In this paper, we design and analyze several algorithms for (metric) facility location, fc-median, fc-means 
and A;-center problems, focusing on parallelizing a diverse set of techniques in approximation algorithms. 



We study the algorithms on the EREW PRAM and the Parallel Cache Oblivious model pGS10| . The 
latter model captures memory locality. We are primarily concerned with minimizing the work (or 
cache complexity) while achieving polylogarithmic depth in these models. We are less concerned with 
polylogarithmic factors in the depth since such measures are not robust across models. By work, we mean 
the total operation count. All algorithms we develop are in NC or RNC, so they have polylogarithmic 
depth. 



We first present a parallel RNC algorithm mimicking the greedy algorithm of Jain et al. | JMM"'"0"3| . This 



is the most challenging algorithm to parallelize because the greedy algorithm is inherently sequential. 
We show the algorithm gives a (3.722 + e)-approximation and does 0(mlogf_^£ m) work, which is within 
a logarithmic factor of the serial algorithm. Then, we present a simple RNC algorithm using the primal- 



dual approach of Jain and Vazirani | JV01 | which leads to a (3 + e)-approximation and for input of size 
m runs in O(mlog]^_|_£ m) work, which is the same as the sequential work. The sequential algorithm 
is a 3-approximation. Following that, we present a local-search algorithm for A;- median and A;- means, 
with approximation factors of 5 + e and 81 + e, matching the guarantees of the sequential algorithms. 
For constant k, the algorithm does O(ri^logn) work, which is the same as the sequential counterpart. 
Furthermore, we present a 2-approximation algorithm for A;-center with 0((n log n)^) work, based on 
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the algorithm of Hochbaum and Shmoys [HS85| ] . Fmally, we show a 0(?n logf_|_£(m))-work randomized 
rounding algorithm, which yields a (4 + e)-approximation, given an optimal linear-program solution 
as input. The last two algorithms run in work within a logarithmic factor of the serial algorithm 
counterparts. 

1.2 Related Work 

Facility-location problems have had a long history. Because of space consideration, we mention only 
some of the results here, focusing on those concerning metric instances. For the (uncapacitated) metric 
facility location, the first constant factor approximation was given by Shmoys et al. [pTAQTj , using an 



LP-rounding technique, which has subsequently been improved [Chu98, GK99|. A different approach. 



based on local-search techniques, has been used to obtain a 3-approximation | KPROO , AGK^04 , GT08|. 
Combinatorial algorithms based on primal-dual and greedy approaches with constant approximation 
factors are also known | JMM"'"0"3 | ^ JVOl, PT03]. Other approximation algorithms and hardness results 
have also been given by |Svi02| , |CS03| , |Byr07l , |CG05| , |MMSV01| , |MYZ02| , |KPROO| , |CGO^, |GK9l. An 



open problem is to close the gap between the best known approximation factor of 1.5 [Byr07| and the 



hardness result of 1.463 |GK99|] . 

The first constant factor approximation for fc-median problem was given by Charikar et al. |CGTS02], 
which was subsequently improved by CG05 | and | AGK"'"04 ] to the current best factor of 3 + e. For 
fe-means, constant-factor approximations are known for this problem [JVOl, pT08 ]; a special case when 
the metric space is the Euclidean space has also been studied [ KMN^d4 |. For A;-center, tight bounds 
are known: there is a 2-approximation algorithm due to [Gon85, HS86|, and this is tight unless P = NP. 

The study of parallel approximation algorithms has been slow since the early 1990s. There are RNC 
and NC parallel approximation algorithms for set cover [BRS8E, RV98[| , vertex cover [KVY94, KY09], 
special cases of linear programs (e.g., positive LPs and cover-packing LPs) | |LN93 , 3ri01, You01|, and 
/c-center |WC90|. These algorithms are typically based on parallelizing their sequential counterparts, 
which usually contain an inherently sequential component (e.g., a greedy step which requires picking 
and processing the minimum-cost element before proceeding to the next). A common idea in these 
parallel algorithms is that instead of picking only the most cost-effective element, they make room for 
parallelism by allowing a small slack (e.g., a (1 + e) factor) in what can be selected. This idea often 
results in a slightly worse approximation factor than the sequential version. For instance, the parallel 
set-cover algorithm of Rajagopalan and Vazirani is a (2(1 -|-e) In n)-approximation, compared to a (Inn)- 
approximation produced by the standard greedy set cover. Likewise, the parallel vertex-cover algorithm 
of Khuller et al. is a 2/(1 — e)-approximation as opposed to the optimal 2-approximation given by 
various known sequential algorithms. Only recently has the approximation factor for vertex cover been 
improved to 2 in the parallel case [ KY09| ] . 



Several approximation algorithms have been proposed for distributed computing; see, e.g. ElkO^ ] , for a 
survey. For facility location, recent research has proposed a number of algorithms, both for the metric 



and non- metric cases [ MW05 , GLS06 , PP09 |. The work of Pandit and Pemmaraju [ PP09 is closely 
related our primal-dual algorithm; their algorithm is a 7-approximation in the CONGEST model for 
distributed computing. Both their algorithm and ours have a similar preprocessing step and rely on 
the (1 -|-e)-slack idea although their algorithm uses a fixed e = 1. The model and the efficiency metrics 
studied are different, however. 
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2 Preliminaries and Notation 



Let F denote a set of facilities and C denote a set of clients. For convenience, let ric = |C|, nj = \F\, and 
m = Uc X rif. Each facility i ^ F has a cost of /j, and each client j € C incurs a cost ("distance") i) 
to use the facility i. We assume throughout that there is a metric space {X, d) with F U C C X that 
underlies our problem instances. Thus, the distance d is symmetric and satisfies the triangle inequality. 
As a shorthand, denote the cost of the optimal solution by opt, the facility set of the optimal solution 
by F* , and the facility set produced by our algorithm by Fa- Furthermore, we write d{u,S) to mean 
the minimum distance from u to a member of S, i.e., d{u, S) = min{(i(n, w) : w € S}. 

Let G be a graph. We denote by degQ^v) the degree of the node v in G and use Tg{v) to denote the 
neighbor set of the node v. We drop the subscript (i.e., writing deg('y) and T{v)) when the context is 
clear. Let V{G) and E{G) denote respectively the set of nodes and the set of edges. 

Parallel Models. All the parallel algorithms in this paper can be expressed in terms of a set of simple 
operations on vectors and dense matrices, making it easy to analyze costs on a variety of parallel models. 
In particular, the distances d{-, •) can be represented as a dense n x n matrix, where n = ric + Uf, and 
any data at clients or facilities can be represented as vectors. The only operations we need are parallel 
loops over the elements of the vector or matrix, transposing the matrix, sorting the rows of a matrix, 
and summation, prefix sums and distribution across the rows or columns of a matrix or vector. A prefix 
sum returns to each element of a sequence the sum of previous elements. The summation or prefix sum 
needs to be applied using a variety of associative operators, including min, max, and addition. 

We refer to all the operations other than sorting as the basic matrix operation. The basic matrix 
operations on m elements can all be implemented with 0{m) work and O(logm) time on the EREW 



PRAM [JaJ92], and with 0[m/B) cache complexity and O(logm) depth in the parallel cache oblivious 
model. For the parallel cache oblivious model, we assume a tall cache M > B^, where M is the size 
of the cache and B is the block size. Sorting m elements takes 0(m log m) work and O(logm) time on 



an EREW PRAM | Col88|J, and 0(^ log^.//^ m) cache complexity and 0(log m) depth on the parallel 



cache oblivious model [ BGS10| . All algorithms described in this paper are cache efficient in the sense 



that the cache complexity in the cache oblivious model is bounded by 0{w/B) where w is the work in 
the EREW model. All algorithms use a polylogarithmic number of calls to the basic matrix operations 
and sorting and are thus in RNC — do polynomial work with polylogarithmic depth and possibly use 
randomization. 

Given this set up, the problems considered in this paper can be defined as follows: 

(Metric) Facility Location. The goal of this problem is to find a set of facilities Fs F that 
minimizes the objective function 

FACL0C(F5) =J2fi + Yl ^(J' (1) 

ii=Fs jec 

Note that we do not need an explicit client-to- facility assignment because given a set of facilities Fs, 
the cost is minimized by assigning each client to the closest open facility. 

Non-trivial upper- and lower-bounds for the cost of the optimal solution are useful objects in approxima- 
tion algorithms. For each client j E G, let 7^ = minjgi?(/j -|- d(j,i)) and 7 = maxj^clj- The following 
bounds can be easily established: 

7 < opt < ^7j < 7?!^. (2) 
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Furthermore, metric facility location has a natural integer-program formulation for which the relaxation 
yields the pair of primal and dual programs shown in Figure ||. 



Minimize EieFjGC ^0'' + HieF fiVi Maximize X^j^c 

E.eF^u > 1 forjeC r Y.,ecN < h for z G F 

Subj. to: { yi~ > for i e F,j e C Subj. < aj - I3^j < d{j,i) for i e F,j e C 

Xrj > 0, ij, > to: [ |3^■i > 0, a-i > 



Figure 1: The primal (left) and dual (right) programs for metric (uncapacitated) facility location. 

A;-Median and /c-Means. Unlike facility location, the fc-median objective does not take into consider- 
ation facility costs, instead limiting the number of opened centers (facilities) to k. Moreover, in these 
problems, we typically do not distinguish between facilities and clients; every node is a client, and every 
node can be a facility. Formally, let y C X be the set of nodes, and the goal is to find a set of at 
most k centers Fs QV that minimizes the objective kM.ed{Fs) = '^j^y d{j, Fs). Almost identical to 
/c-median is the /c-means problem with the objective kMeans(-F5) = X^jgc* d^(j, -^5)- 

A;-Center. Another type of facility-location problem which has a hard limit on the number of facilities 
to open is /c-center. The /^-center problem is to find a set of at most k centers Fs QV that minimizes 
the objective kCenter(Fs) = maxj^zv d{j, Fs). 

In these problems, we will use n to denote the size of V. 

3 Dominator Set 

We introduce and study two variants of the maximal independent set (MIS) problem, which will prove 
to be useful in nearly all algorithms described in this work. The first variant, called the dominator set 
problem, concerns finding a maximal set I Q V of nodes from a simple graph G = {V, E) such that 
none of these nodes share a common neighbor (neighboring nodes of G cannot both be selected). The 
second variant, called the U -dominator set problem, involves finding a maximal set / C [/ of the [/-side 
nodes of a bipartite graph H = {U, V, E) such that none of the nodes have a common y-side neighbor. 
We denote by MaxDom(G) and MAxUDoM(i7) the solutions to these problems, resp. 

Both variants can be equivalently formulated in terms of maximal independent set. The first variant 
amounts to finding a maximal independent set on 

G^ = {V, {uw : uw & E or 3z s.t. uz, zw G E}), 

and the second variant a maximal independent set on 

H' = {U, {uw -.BzeV s.t. uz, zw G E}). 

Because of this relationship, on the surface, it may seem that one could simply compute G^ or H' 
and run an existing MIS algorithm. Unfortunately, computing graphs such as G^ and H' appears to 
need 0{n'^) work, where cj is the matrix- multiply constant, whereas the naive greedy- like sequential 
algorithms for the same problems run in 0{\E\) = 0{n^). This difference makes it unlikely to obtain 
work efficient algorithms via this route. 
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In this section, we develop near work-efficient algorithms for these problems, bypassing the construction 
of the intermediate graphs. The key idea is to compute a maximal independent set in-place. Numerous 
parallel algorithms are known for maximal independent set, but the most relevant to us is an algorithm 



of Luby [ Lub86 |, which we now sketch. 

The input to the algorithm is a graph G = {y,E). Luby's algorithm constructs a maximal independent 
set / C y by proceeding in multiple rounds, with each round performing the following computation: 

Algorithm 3.1 The select step of Luby's algorithm for maximal independent set. 

1. For each i G y, in parallel, 7r(z) = a number chosen u.a.r. from {1, 2, . . . , 2n^}. 

2. Include a node i in the maximal independent set I if 7r(i) < min{7r(j) : j € r(i)}, where T{i) is 
the neighborhood of i in G. 



This process is termed the select step in Luby's work. Following the select step, the newly selected 
nodes, together with their neighbors, are removed from the graph before moving on to the next round. 

Implementing the select step: We describe how the select step can be performed in-place for the 
first variant; the technique applies to the other variant. We will be simulating running Luby's algorithm 
on G^, without generating G^. Since G^ has the same node set as G, step 1 of Algorithm 3.1 



remains 

unchanged. Thus, the crucial computation for the select step is to determine efficiently, for each node i, 
whether 7r(i) holds the smallest number among its neighbors in G^, i.e., computing efficiently the test 
in step 2. To accomplish this, we simply pass the 7r(i) to their neighbors taking a minimum, and then 
to the neighbors again taking a minimum. These can be implemented with a constant number of basic 
matrix operations, in particular distribution and summation using minimum over the rows and columns 
of the |yp matrix. 

Lemma 3.1 Given a graph G = (V^E), a maximal dominator set I Q V can be found in expected 
0{log'^ \V\) depth and 0(|yplog|y|) work. Furthermore, given a bipartite graph G = {U,V,E), a 
maximal U -dominator set I QU can be found in expected 0((log \U\) ■ maxjlog |f/|,log \V\}) depth and 
0{\V\\U\ maxjlog |C/|,log \V\}) work. 

For sparse matrices, which we do not use in this paper, this can easily be improved to 0{\E\ log \V\) 
work. 



4 Facility Location: Greedy 

The greedy scheme underlies an exceptionally simple algorithm for facility location, due to Jain et 
al. [ [IMM+O"^ . Despite the simplicity, the algorithm offers one of the best known approximation guar- 
antees for the problem. To describe the algorithm, we will need some definitions. 

Definition 4.1 (Star, Price, and Maximal Star) A star S = {i,G') consists of a facility i and a 
subset C' C G. The price of S is price(5) = (/j -|- X^jeC <^0'' A star S is said to be maximal 
if all strict super sets of G' have a larger price, i.e., for all C" 2 C , price((i, G")) > price((i, G')). 

The greedy algorithm of Jain et al. proceeds as follows: 



Until no client remains, pick the cheapest star (i, G'), open the facility i, set fi = 0, remove 
all clients in G' from the instance, and repeat. 
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This algorithm has a sequential running time of 0(m log m) and using techniques known as factor- 
revealing LP, Jain et al. show that the algorithm has an approximation factor of 1.861 |JMM^03]. From 



a parallelization point of view, the algorithm is highly sequential — at each step, only the minimum-cost 
option is chosen, and every subsequent step depends on the preceding one. In this section, we describe 
how to overcome this sequential nature and obtain an RNC algorithm inspired by the greedy algorithm 
of Jain et al. We show that the parallel algorithm is a (3.722 -t- e)-approximation. 

The key idea to parallelization is that much faster progress will be made if we allow a small slack in 
what can be selected in each round; however, a subselection step is necessary to ensure that facility and 
connection costs are properly accounted for. 

Algorithm 4.1 Parallel greedy algorithm for metric facility location. 

In rounds, the algorithm performs the following steps until no client remains: 

1. For each facility i, in parallel, compute Si = {i,C^^^), the lowest-priced maximal star centered 
at i. 

2. Let T = miujgi? price(5j), and let I = {i e F : price(5j) < t(1 + e)}. 

3. Construct a bipartite graph H = {I,C',{ij : d{i,j) < t(1 + e)}), where C = {j £ C : 3i G 

Is.t. d{i,j) <t{1+£)}. 

4. Facility Subselection: while (/ / 0): 

(a) Let n : / — 7- {1, . . . , |/|} be a random permutation of /. 

(b) For each j £ C' , let ipj = argminjgp/fO) n(i). 

(c) For each i € /, if \{j = i}\ > 2(i+e) d6g(«), add i to Fa (open i), set fi = 0, remove i from 
/ , and remove from both C and C . 

Note: In the analysis, the clients removed in this step have ttj set as follows. If the facility 
ifj is opened, let VTj = ipj; otherwise, iTj is set to any facility i we open in this step such that 
ij € E{H). Note that any facility that is opened is at least 1/(2(1 paid for by the clients 

that select it, and that since every client is assigned to at most one facility, they only pay for 
one edge. 

(d) Remove i € I (and the incident edges) from the graph H if on the remaining graph, 
— ^dcg{i)^ ^ ^ ~^ '^l^^se facilities will show up in the next round (outer-loop). 

Note: After fi is set to 0, facility i will still show up in the next round. 



We present the parallel algorithm in Algorithm 1.1 and now describe step 1 in greater detail; steps 2-3 
can be implemented using standard techniques [ JaJ92| , Lei92 |. As observed in Jain et al. lJMM+0311 (see 



also Fact |4.2| ), for each facility i, the lowest-priced star centered at i consists of the Ki closest clients to i, 
for some Kj. Following this observation, we can presort the distance between facilities and clients for each 
facility. Let i be a facility and assume without loss of generality that d{i, 1) < d{i,2) < • • • < d{i,nc). 
Then, the cheapest maximal star for this facility can be found as follows. Using prefix sum, compute 
the sequence p^*) = {{fi + '^j<kd{i,k))/k}^'Li. Then, find the smallest index k such that < 
or use = ric if no such index exists. It is easy to see that the maximal lowest-priced star centered at i 
is the facility i together with the client set {1, . . . ,k}. 

Crucial to this algorithm is a subselection step, which ensures that every facility and the clients that 
connect to it are adequately accounted for in the dual-fitting analysis. This subselection process can be 
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seen as scaling back on the aggressiveness of opening up the facilities, mimicking the greedy algorithm's 
behavior more closely. 

4.1 Analysis 

We present a dual-fitting analysis of the above algorithm. The analysis relies on the client-to-facility 
assignment vr, defined in the description of the algorithm. The following easy-to-check facts will be 
useful in the analysis. 

Fact 4.2 For each iteration of the execution, the following holds: (1) If Si is the cheapest maximal 
star centered at i, then j appears in Si if and only if d{j,i) < price(<Si). (2) If t = price(5i), then 
Yljec max(0, t - d{j, i)) = fi. 

Now consider the dual program in Figure ||. For each client j, set aj to be the r setting in the iteration 
that the client was removed. We begin the analysis by relating the cost of the solution that the algorithm 
outputs to the cost of the dual program. 

Lemma 4.3 The cost of the algorithm's solution X^jgp'^ fi + X^jgc -^a) is upper-bounded by 2(1 -|- 

Proof: Consider that in step 4(c), a facility i is opened if at least a 2(i+e) fraction of the neighbors 
"chose" i. Furthermore, we know from the definition of H that, in that round, fi + Ejgri/(i) d{j,i) < 
r(l + e)deg(i). By noting that we can partition C by which facility the client is assigned to in the 
assignment vr, we establish 

jeC i€FA j.Trj=i 

i&FA jec 

as desired. ■ 

In the series of claims that follows, we show that when scaled down by a factor of 7 = 1.861, the a 
setting determined above is a dual feasible solution. We will assume without loss of generality that 
ai < "2 < • • • < anc- Let Wi = {j e C : aj > 'j ■ d{j, i)} for all i e F and W = UiWi. 

Claim 4.4 For any facility i £ F and client jo £ C , 

^ max(0,ajo - (i(j,z)) < fi. 

Proof: Suppose for a contradiction that there exist client j and facility i such that the inequality in 
the claim does not hold. That is, 

^ max(0,aj(, - (i(j,i)) > fi. (3) 

Consider the iteration in which r is aj^] call this iteration I. By Equation (^), there exists a client 
j G n {j G : j > jo} such that aj^ — d{j,i) > 0; thus, this client participated in a star in an 
iteration prior to i and was connected up. Therefore, it must be the case that Oj < Oj^, which is a 
contradiction to our assumption that jo < j and ai < 02 < Oin^- B 
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Claim 4.5 Let i G F, and ^ W be clients. Then, aj < aji + d{i,j') + d{i,j). 



The proof of this claim closely parallels that of Jain et al. | JMM+03| and is omitted. These two claims 



form the basis for the set up of Jain et al.'s factor-revealing LP. Hence, combining them with Lemmas 



3.4 and 3.6 of Jain et al. |JMM+03| , we have the following lemma: 



Lemma 4.6 The setting a'j = — and f3'-j = max(0, a^- — d{j,i)) is a dual feasible solution, where 
7 = L861. ^ 

An Alternative Proof Without Factor- Revealing LP. We note that a slightly weaker result can 



be derived without the use of factor-revealing LP. Claims |4.5| and can be combined to prove the 
following lemma: 

Lemma 4.7 The setting a'j = aj/3 and j3'-- = max(0, — d{j,i)) is a dual feasible solution. 
Proof: We will show that for each facility i & F, 

{a,-3-d{j,i)) < 3-/,. (4) 

jew. 

Note that if Wi is empty, the lemma is trivially true. Thus, we assume Wi is non-empty and define jo to 
be minWj. Since jo G Wi, d{jQ,i) < aj^ by the definition of Wi. Now let T = {j € Wi : Ojg > d{j,i)}. 
Applying Claims |4.5| and [4.4|, we have 



{oj - d{j, i)) < Y («io + dUo,i)) < Y ' "^o 
jew, jew, jew, 

<2fi + Y^-d{j,i)+ Yl 2-d(j,i) <2/,+ ^ 2-d(i,i), 
jeT jeW,\T jeWi 

which proves inequality (Q). With this, it is easy to see that our choice of /3^j's ensures that all constraints 
of the form aj — fiij < d{j, i) are satisfied. Then, by inequality (Q), we have X^jgc" i^^-xlO, aj — 3-d{j, i)) = 
X^jgVFi ['-'^i ~ ■ '^(j' ^)] — '^■/«' which implies that ^^g,^ max(0, — 3-(i(j, z)) < 3-fi. Hence, we conclude 
that for all facility i ^ F, YljeC l^'ij — fii proving the lemma. I 



Running time analysis Consider the algorithm's description in Algorithm The rows can be 
presorted to give each client its distances from facilities in order. In the original order, each element 
can be marked with its rank. Step 1 then involves a prefix sum on the sorted order to determine how 
far down the order to go and then selection of all facilities at or below that rank. Steps 2-3 require 
reductions and distributions across the rows or columns of the matrix. The subset I C F can be 
represented as a bit mask over F. Step 4 is more interesting to analyze; the following lemma bounds 
the number of rounds facility subselection is executed, the proof of which is analogous to Lemma 4.1.2 



of Rajagopalan and Vazirani [[IV98|; we present here for completeness a simplified version of their proof, 
which suffices for our lemma. 

Lemma 4.8 With probability 1 — o(l), the subselection step terminates within 0(log]^_|,£ m) rounds. 

Proof: Let ^ = \E\. We will show that if ^' is the potential value after an iteration of the subselection 
step, then E[$ — > c<l>, for some constant c > 0. The lemma then follows from standard results in 
probability theory. To proceed, define choserij = \{j £ C : ipj = Furthermore, we say that an edge 
ij is good if at most 6 = |(1 — j^) fraction of neighbors of i have degree higher than j. 
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Consider a good edge ij. We will estimate E [chosen = i]. Since ij is good, we know that 



Yl l{deg(i')<dcg(j)} > (1 - ^) deg{i). 

Therefore, E [chosen jlyPj = i] > ^(1 — 6) deg{i), as it can be shown that Pr[93j/ = i\ipj 
j' € ^Hii) and deg(j') < deg(j). By Markov's inequality and realizing that chosen j < deg(i), we have 

1 



> 



for all 



Pr 



chosen, > 



2(1 + 6 



■ deg(«) 



Po > 0. 



Finally, we note that E[<I> - <!>'] is at least 

i and chosen, > 



ij€E 



2(1 + e; 



■ deg(i) 



deg(j) 



> 



E 



1 



good ij€E 



deg(i) 



Po deg(j) 



> Po E ■'■{^i is good}- 

ijeE 

Since at least 6 fraction of the edges are good, E[$ — <!>'] > poO^. Since ln(l/(l — PqO)) = r2(log(l + e)), 
the lemma follows from standard results in probability [ MR95| ]. ■ 

It is easy to see that each subselection step can be performed with a constant number of basic matrix 
operations over the D matrix. Therefore, if the number of rounds the main body is executed is r, the 
algorithm makes 0{r log^_)_£ m) calls to the basic matrix operations described in Section ^ with probabil- 
ity exceeding 1 — o(l). It also requires a single sort in the preprocessing. This means 0{r log]^^^ m log m) 
time implies a total of 0(rm log^^^ m) work (with probability exceeding 1 — o(l)) on the EREW PRAM. 
Furthermore, it is cache efficient (cache complexity is 0{w/B)) since the sort is only applied once and 
does not dominate the cache bounds. 



Bounding the number of rounds Before describing a less restrictive alternative, we point out that 
the simplest way to bound the number of rounds by a polylogarithm factor is to rely on the common 
assumption that the facility cost, as well as the ratio between the minimum (non-zero) and the maximum 
client-facility distance, is polynomially bounded in the input size. As a result of this assumption, the 
number of rounds is upper-bounded by \ogij^g,{m^) = 0{\ogi_^_^m), for some c > 1. 

Alternatively, we can apply a preprocessing step to ensure that the number of rounds is polylogarithm 
in m. The basic idea of the preprocessing step is that if a star is "relatively cheap," opening it right 
away will harm the approximation factor only slightly. Using the bounds in Equation (0), if Si is the 
lowest-priced maximal star centered at i, we know we can afford to open i and discard all clients attached 
to it if price(5,) < Therefore, the preprocessing step involves: (1) computing Si, the lowest-priced 
maximal star centered at i, for all i F, (2) opening all i such that price(5j) < (3) setting /j of 
these facilities to and removing all clients attached to these facilities. 

Computing 7 takes 0(lognc + lognj) depth and 0{m) work. The rest of the preprocessing step is at 
most as costly as a step in the main body. Thus, the whole preprocessing step can be accomplished in 
O(logm) depth and 0{m) work. With this preprocessing step, three things are clear: First, r in the first 
iteration of the main algorithm will be at least because cheaper stars have already been processed 
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in preprocessing. Second, the cost of our final solution is increased by at most nc x < < opt/m, 
because the facilities and clients handled in preprocessing can be accounted for by the cost of their 
corresponding stars — specifically, there can be most ric stars handled in preprocessing, each of which 
has price < and the price for a star includes both the facility cost and the connection cost of the 

relevant clients and facilities. Finally, in the final iteration, r < ric'j- As a direct consequence of these 
observations, the number of rounds is upper-bounded by logl_^_^{:^J^) < logi^^{m^) = 0{logl_^_^ 
culminating in the following theorem: 



m 



Theorem 4.9 Let < e < 1 be fixed. For sufficiently large input, there is a greedy-style RNC 
0{mlogi^^{m))-work algorithm that yields a factor-{Q + £) approximation for the metric facility-location 
problem. 

5 Facility Location: Primal-Dual 

The primal-dual scheme is a versatile paradigm for combinatorial algorithms design. In the context of fa- 
cility location, this scheme underlies the Lagrangian-multiplier preserving^ (LMP) 3-approximation algo- 
rithm of Jain and Vazirani, enabling them to use the algorithm as a subroutine in their 6-approximation 



algorithm for fc-median | JV01 |. 



The algorithm of Jain and Vazirani consists of two phases, a primal-dual phase and a postprocessing 
phase. To summarize this algorithm, consider the primal and dual programs in Figure ||. In the primal- 
dual phase, starting with all dual variables set to 0, we raise the dual variables ctj's uniformly until a 
constraint of the form aj — < d{j, i) becomes tight, at which point /3jj will also be raised, again, 
uniformly to prevent these constraints from becoming overtight. When a constraint fiij < fi is tight, 
facility i is tentatively opened and clients with ctj > d{j,i) are "frozen," i.e., we stop raising their aj 
values from this point on. The first phase ends when all clients are frozen. In the postprocessing phase, 
we compute and output a maximal independent set on a graph G of tentatively open facilities; in this 
graph, there is an edge between a pair of facilities i and i' if there is a client j such that aj > d{j, i) 
and aj > d{j,i'). Thus, the maximal independent set ensures proper accounting of the facility cost (i.e., 
each client "contributes" to at most one open facility, and every open facility has enough contribution). 
Informally, we say that a client j "pays" for or "contributes" to a facility i if = aj — d{j, i) > 0. 

Remarks. We note that in the parallel setting, the description of the postprocessing step above does 
not directly lead to an efficient algorithm, because constructing G in polylogarithmic depth seems to 
need 0{mnf) work, which is much more than one needs sequentially. 

In this section, we show how to obtain a work-efficient RNC (3 + e)-approximation algorithm for facility 
location, based on the primal-dual algorithm of Jain and Vazirani. Critical to bounding the number of 
iterations in the main algorithm by O(logm) is a preprocessing step, which is similar to that used by 
Pandit and Pemmaraju in their distributed algorithm ]PP09 |. 



Preprocessing: Assuming 7 as defined in Equation (0), we will open every facility i that satisfies 



E 



max ( 0, - z) ) > fi. 



Furthermore, for all clients j such that there exists an opened i and d{j,i) < we declare them 

connected and set aj = 0. The facilities opened in this step will be called free facilities and denoted by 
the set Fq. 



^This means oX^^g^^ fi + Sjgc '^ih^A) < a ■ opt, where a is the approximation ratio. 
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Main Algorithm: The main body of the algorithm is described in Algorithm 5.1. The algorithm 
outputs a bipartite graph H = {Ft,C,E), constructed as the algorithm executes. Here Ft is the set 
of facilities declared open during the iterations of the main algorithm and E is given hy E = {ij : i € 
F,j G C, and (1 + e)aj > d{j,i)}. 

Algorithm 5.1 Parallel primal-dual algorithm for metric facility location 

For iteration £ = 0, 1, ... , the algorithm performs the following steps until all facilities are opened or all 
clients are frozen, whichever happens first. 

1. For each unfrozen client j, in parallel, set aj to ^(1 + e)^- 

2. For each unopened facility i, in parallel, declare it open if 

^ max(0, (1 + e)aj - d{j, i)) > ft. 

3. For each unfrozen client j, in parallel, freeze this client if there exists an opened facility i such 
that (1 + e)aj > d{j, i). 

4. Update the graph H by adding edges between pairs of nodes ij such that (1 + e)aj > d{j,i). 

After the last iteration, if all facilities are opened but some clients are not yet frozen, we determine in 
parallel the aj settings of these clients that will make them reach an open facility (i.e., aj = miuj d{j, i)). 
Finally, update the graph H as necessary. 

Post-processing. As a post-processing step, we compute / = MAxUDOM(if). Thus, the set of 
facilities / C Ft has the property that each client contributes to the cost of at most one facility in /. 
Finally, we report F^ = I U Fq as the set of facilities in the final solution. 

5.1 Analysis 

To analyze approximation guarantee of this algorithm, we start by establishing that the aj setting 
produced by the algorithm leads to a dual feasible solution. 

Claim 5.1 For any facility i, 

^ max(0, Oj - d{j, i)) < fi. 

Proof: Let denote the aj value at the end of iteration I. Suppose for a contradiction that there 
is a facility i which is overtight. More formally, there exists i £ F and the smallest i such that 
SjGrF(i) '^j'^^ — d{j,i)) > fi- Let J be the set of unfrozen neighboring clients of i in iteration 

i — The reason facility i was not opened in iteration i—1 and the surrounding clients were not frozen 
is 

raisedi ^= max(0, (1 -|- £:)o(^^ — d{j, i)) + max(0, (1 -|- — d{j, i)) < fi. 

jerpii)\j j&J 

However, we know that ti = {1 + e)ti-i, and for each frozen neighboring client j (i.e., j € ^F{i) \ J), 
{£) (e-i) 
j ' ^° 

raisedi > max(0, a^j'^ — d{j, i)) + max(0, t£ — d{j, i)) = max(0, a^p — d{j, i)), 

jer{i)\j jeJ jer(i) 
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which is a contradiction. 



It fohows from this claim that setting /3jj = max(0, — d{j,i)) provides a dual feasible solution. Next 
we relate the cost of our solution to the cost of the dual solution. To ease the following analyses, we use 
a client-to-facility assignment vr : C — > F, defined as follows: For all j € C, let ip{j) = {i : (1 + e)aj > 
d{j,i)}. Now for each client j, (1) if there exists i G Fq such that d{j,i) < set TTj to any such i; 

(2) if there exists i € I such that ij is an edge in H, then nj = i {i is unique because of properties of 
/) ; (3) if there exists i € / such that i G then iTj = i; (4) otherwise, pick i' G ip{j) and set ttj to 
i G / which is a neighbor of a neighbor of i' . 

Clients of the first case, denoted by Co, are called freely connected; clients of the cases (2) and (3), 
denoted by Ci, are called directly connected. Otherwise, a client is indirectly connected. 

The following lemmas bound the facility costs and the connection costs of indirectly connected clients. 
Lemma 5.2 

Proof: When facility i & Ft was opened, it must satisfy fi < J2j-ij£E{G)('^~''~^)^J i)- If client j has 
contributed to i (i.e., (1 + e)aj — d{j,i) > 0) and i € I, then j is directly connected to it. Furthermore, 
for each client j, there is at most one facility in / that it contributes to (because / = MaxUDom(//)). 
Therefore, Yli^i fi — J2jeCi(^ + e)aj — d{j,7rj). Furthermore, for each "free" facility, we know that 
fi < Ejec™a^(0'^^M^ ~ d{j,i)), so by our choice of vr, < x - EjeCo:7r,=i '^(•?' "^^^^^ 
X^jgFo fi < 7/m — Z^jgCQ d{j,i). Combining these results and observing that Fa is the disjoint union 
of / and Fq, we have the lemma. ■ 

Lemma 5.3 For each indirectly connected client j (i.e., j ^ CqL) Ci), we have d{j,TTj) < 3(1 + e)aj. 

Proof: Because j ^ CqU Ci and / = MAxUDOM(i7), there must exist a facility i' G ^{j) and a client 
j' such that j' contributed to both i and i', and (1 + e)aj > d{j,i'). We claim that both d{j',i') and 
d{j',i) are upper-bounded by (1 + e)aj. To see this, we note that because j' contributed to both i 
and i', d{j',i') < (1 + e)aj' and d{j',i) < (1 + e)aj'. Let £ be the iteration in which j was declared 
frozen, so aj = t£. Since i' G ^{j), i' must be declared open in iteration < £. Furthermore, because 
(1 + £)aji > d{j',i'), Oji must be frozen in or prior to iteration I. Consequently, we have aj' <ti = aj. 
Combining these facts and applying the triangle inequality, we get d{j, i) < d{j, i') + d{i' + d{j' ,i) < 
(1 + £)aj + 2(1 + £)aj' < 3(1 + e] 



a 



J- 



By Lemmas and |5.3| , we establish 



^E^^ + E'^^-^'^i) ^ — + 3(l + e) J^a,. (5) 

ieFA j€C ^ j&c 



Now since {oj, /3jj} is dual feasible, its value can be at most that of the primal optimal solution; that is, 
X^j 01 j < opt. Therefore, combining with Equation (|5|), we know that the cost of the solution returned 
by parallel primal-dual algorithm in this section is at most 3 J^ieP^ fi + SjeC ^(^^ C") < (3 + e')opt for 
some e' > when the problem instance is large enough. 
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Running Time Analysis We analyze the running of the algorithm presented, starting with the main 
body of the algorithm. Since aj < opt and opt < nc7, no aj can be bigger than ricj < mj. Hence, 
the main algorithm must terminate before i > 31og]^_|_j m, which upper-bounds the number of iterations 
to 0{logij_^m). In each iteration, steps 1, 3, and 4 perform trivial work. Step 2 can be broken down 
into (1) computing the max for all i F,j E C,, and (2) computing the sum for each i £ F. These 
can all be implemented with the basic matrix operations, giving a totalof 0{logi_^^ m) of basic matrix 
operations over a matrix of size m. 

The preprocessing step, again, involves some reductions over the rows and columns of the matrix. This 
includes the calculations of 7j's and the composite 7. The post-processing step relies on computing the 
[/-dominating set, as described in Section ^ which runs in O(logm) matrix operations. 

The whole algorithm therefore runs in 0(logi_|_£ jn) basic matrix operations and is hence work efficient 
compared to the 0(m log m) sequential algorithm of Jain and Vazirani. Putting these altogether, we 
have the following theorem: 

Theorem 5.4 Let e > be fixed. For sufficiently large m, there is a primal-dual RNC 0{mlogi^^ m)- 
work algorithm that yields a factor-{2> + e) approximation for the metric facility-location problem. 

6 Other Results 

In this section, we consider other applications of dominator set in facility-location problems. 
6.1 /c-Center 

Hochbaum and Shmoys [[IS85| show a simple factor-2 approximation for /c-center. The algorithm per- 



forms a binary search on the range of distances. We show how to combine the dominator-set algorithm 
from Section ^ with standard techniques to parallelize the algorithm of Hochbaum and Shmoys, re- 
sulting in an RNC algorithm with the same approximation guarantee. Consider the set of distances 
V = {d{i,j) : i € C and j € V} and order them so that di < d2 < • • • < dp and {di, . . . , dp} = V, where 
p = |D|. The sequence {di}^^^ can be computed in 0(log|l^|) depth and 0(|yp log work. Let Ha 
be a graph defined as follows: the nodes of is the set of nodes V, but there is an edge connecting i 
and j if and only if d{i,j) < a. 

The main idea of the algorithm is simple: find the smallest index t G {1, 2, . . . , p} such that MAxDoM(i?dj) < 
k. Hochbaum and Shmoys observe that the value t can be found using binary search in O(logp) = 
0(log |y|) probes. We parallelize the probe step, consisting of constructing H^^, for a given t' € 
{!,..., p} and checking whether |MAxDOM(//rf^, )| is bigger than k. Constructing H^^, takes 0(1) 
depth and 0(|yp) work, and using the maximal-dominator-set algorithm from Section ^, the test can 
be performed in expected 0(log^ |y|) depth and expected 0(|yplog|y|) work. The approximation 
factor is identical to the original algorithm, hence proving the following theorem: 

Theorem 6.1 There is an RNC 2 -approximation algorithm with 0{{\V\log\V\)'^) work for k-center. 
6.2 Facility Location: LP Rounding 

LP rounding was among the very first techniques that yield non-trivial approximation guarantees for 
metric facility location. The first constant-approximation algorithm was given by Shmoys et al. [pTA97 |. 



Although we do not know how to solve the linear program for facility location in polylogarithmic depth, 
we demonstrate another application of the dominator-set algorithm and the slack idea by parallelizing 
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the randomized-rounding step of Shmoys et al. The algorithm yields a (4 + e)-approximation, and the 
randomized romiding is an RNC algorithm. 

The randomized rounding algorithm of Shmoys et al. consists of two phases: a filtering phase and a 
rounding phase. In the following, we show how to parallelize these phases and prove that the parallel 
version has a similar guarantee. Our presentation differs slightly from the original work but works in 
the same spirit. 

Filtering: The filtering phase is naturally parallelizable. Fix a to be a value between and 1. Given an 
optimal primal solution {x,y), the goal of this step is to produce a new solution {x',y') with properties 
as detailed in Lemma |6.2| . Let 5j = XlieF ^(^'•?') ' Bj = {i £ F : d{i,j) < (1 + a)Sj}, and 
ieB^ ^ij- compute x[j and y'^ as follows: (1) let x[j — Xij/vna5s{Bj) if i G Bj or 
otherwise, and (2) let y'^ = min(l, (1 + l/a)yi). 

Lemma 6.2 Given an optimal primal solution {x,y), there is a primal feasible solution {x',y') such 
that (1) 4- = 1, (2) if x[j > 0, then d{j, t) < {1 + a)6j, and (3) Zi hVi < (1 + ^) E^ hVv 

Proof: By construction, (1) clearly holds. Furthermore, we know that if x[j > 0, it must be the case 
that i G Bj, so d{j, i) < {l + a)6j, proving (2). By definition of y-, hVi < (1 + ^) Ei hVi^ proving (3). 
Finally, since in an optimal LP solution, '^^i^ij ~ know that mass(i?j) > j:^, by an averaging 

argument. Therefore, x-^ < (1 + \)xij < min(l, (1 + ^)yi) = y[, showing that {x',y') is primal feasible. 
■ 

Rounding: The rounding phase is more challenging to parallelize because it is inherently sequential — 
a greedy algorithm which considers the clients in an increasing order of 5j and appears to need r2(nc) 
steps. We show, however, that we can achieve parallelism by eagerly processing the clients S = {j : 
6j < (1 + £)t}. This is followed by a clean-up step, which uses the dominator-set algorithm to rectify 
the excess facilities. We precompute the following information: (1) for each j, let ij be the least costly 
facility in Bj, and (2) construct H = (C, F, ij e E \S. i e Bj). 

There is a preprocessing step to ensure that the number of rounds is polylogarithmic in m. Let 9 be the 
value of the optimal LP solution. By an argument similar to that of Section ^, we can afford to process 
all clients with Sj < Qjrn} in the first round, increasing the final cost by at most Qjm < opt/m. The 
algorithm then proceeds in rounds, each performing the following steps: 



1. 


Let T 


= miuj 6j . 


2. 


Let S 


= {j ■■ Sj < (1 + e)T} and 


3. 


Let J 


= MAxUDOM(f/'), add / = {ij : j £ J} to Fa; finally, remove aU of S and Uj^sBj from 




V{H). 





Since J is [/-dominator of H, we know that for all distinct G J, Bj fl Bj' = 0; therefore, X^jg/ fi = 
Y^jejfij < Ejg,7 (EiGBj ^ijfij) ^ ^j&jy'ifij < Ej&jy'ifi, proving the following claim: 

Claim 6.3 In each round, Eie/ fi < ^ieu.Bj Vifi- 

Like our previous analyses, we will define a client-to- facility assignment vr convenient for the proof. For 
each j G C, if ij G F^, let TTj = ij; otherwise, set vTj = ij/, where j' is the client that causes ij to be shut 
down (i.e., either ij G Bj/ and j' was process in a previous iteration, or both j and j' are processed in 
the same iteration but there exists i £ Bj Ci Bji). 
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Claim 6.4 Let j be a client. Ifij € Fa, then d{j,7rj) < {l + a)5j; otherwise, d{j,TTj) < 3(l + a)(l + e)5j. 



Proof: If ij € Fa, then by Lemma 6^, d{j,7Tj) < (1 + a)6j. If ij Fa, we know that there must 



exist i G Bj and j' such that i € -Bj' and 5^' < (1 + £)6j. Thus, applying Lemma 6.2 and the triangle 
inequality, we have d{j, ttj) < d{j, i) + d{i,j') + d{j' , iji) < 3(1 + a)(l + s)5j. ■ 

Running Time Analysis: The above algorithm will terminate in at most 0{logi_^_^m) rounds be- 
cause the preprocessing step ensures the ratio between the maximum and the minimum 5j values are 
polynomially bounded. Like previous analyses, steps 1-2 can be accomplished in 0(1) basic matrix 
operations, and step 3 in O(logm) basic matrix operations on matrices of size m. This yields a total of 
0(log^^g mlogm) basic matrix operations, proving the following theorem: 

Theorem 6.5 Given an optimal LP solution for the primal LP in Figure there is an RNC rounding 
algorithm yielding a (A + e)- approximation with 0{m\ogmlogij_^m) work. It is cache efficient. 

7 fc-Median: Local Search 

Local search, LP rounding, and Lagrangian relaxation are among the main techniques for approximation 
algorithms for fe-median. In this section, building on the algorithms from previous sections, we present 
an algorithm for the /c-median problem, based on local-search techniques. The natural local-search 
algorithm for /c-median is very simple: starting with any set Fa of k facilities, find some i S Fa and 
i' € -F \ Fa such that swapping them decreases the A:-median cost, and repeat until no such moves 
can be found. Finding an improving swap or identifying that none exists takes 0{k{n — k)n) time 
sequentially, where n is the number of nodes in the instance. This algorithm is known to be a 5- 
approximation HAGK-^04 |GT08|. 



The key ideas in this section are that we can find a good initial solution 5*0 quickly and perform each 
local-search step fast. Together, this means that only a small number of local-search steps is needed, 
and each step can be performed fast. To find a good initial solution, we observe that any optimal 
fc-center solution is an n-approximation for A;-median. Therefore, we will use the 2-approximation from 



Section 3^ as a factor- (2n) solution for the /c-median problem. At the beginning of the algorithm, for 
each j € V, we order the facilities by their distance from j, taking 0{n? logn) work and O(logn) depth. 

Let < e < 1 be fixed. We say that a swap {i, i') such that i € Fa and i' £ F \ Fa is improving if 
kMed(F5 — i + i') < (1 — f3 /k)KMED (Fs), where (3 = e/(l + e). The parallel algorithm proceeds as 
follows. In each round, find and apply an improving swap as long as there is one. We now describe how 
to perform each local-search step fast. During the execution, the algorithm keeps track of ipj, the facility 
client j is assigned to, for all j G V. We will consider all possible test swaps i G Fa and i' £ V \ Fa 
simultaneously in parallel. For each potential swap {i,i'), every client can independently compute = 
d{j. Fa — i + i') — d{j. Fa)', this computation trivially takes 0{nc) work and 0(1) depth, since we know 
ipj and the distances are presorted. From here, we know that kM.ed{Fa — i + i') — kM.ed{Fa) = Aj, 
which can be computed in 0(n) work and O(logn) depth. Therefore, in 0{k{n — k)n) work and O(logn) 
depth, we can find an improving swap or detect that none exists. Finally, a round concludes by applying 
an improving swap to Fa and updating the (pj values. 



Arya et al. [AGK_04| show that the number of rounds is bounded by 

O (logi/(i_^/fc) (KMED(5o)/opt)) = O (logi/(i_^/fc)(n) 



Since for < e < 1, In (1/(1 — (3/k)) < | In (1/(1 — (3)), we have the following theorem, assuming 
k G 0(polylog(n)), which is often the case in many applications: 
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Theorem 7.1 For k £ 0(polylog(n)), there is an NC 0{k'^{n — k)n\ogij^i,{n)) -work algorithm which 
gives a factor-{5 + e) approximation for k-median. 



Remarks. Relative to the sequential algorithm, this algorithm is work efficient — regardless of the 
range of k. In addition to fc-median, this approach is applicable to fc-means, yielding an (81 + e)- 
approximation [GT08] in general metric spaces and a (25+e)-approximation for the Euclidean space [[KMN+Ol ], 
and the same parallelization techniques can be used to achieve the same running time. Furthermore, 
there is a factor-3 approximation local-search algorithm for facility location, in which a similar idea 
can be used to perform each local-search step efficiently; however, we do not know how to bound the 
number of rounds. 



8 Conclusion 

This paper studies the design and analysis of parallel approximation algorithms for facility-location 
problems, including facility location, A:-center, /c-median, and /c-means. We presented several efficient 
algorithms, based on a diverse set of approximation algorithms techniques. The practicality of these 
algorithms is a matter pending experimental investigation. 
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